-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implements weighted shuffle using binary tree #185
implements weighted shuffle using binary tree #185
Conversation
e5d304a
to
9c3461f
Compare
9c3461f
to
9558d6d
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #185 +/- ##
=======================================
Coverage 81.9% 81.9%
=======================================
Files 837 837
Lines 226792 226842 +50
=======================================
+ Hits 185795 185851 +56
+ Misses 40997 40991 -6 |
8e8e949
to
52e26b9
Compare
This is partial port of firedancer's implementation of weighted shuffle: https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c Though Fenwick trees use less space, inverse queries require an additional O(log n) factor for binary search resulting an overall O(n log n log n) performance for weighted shuffle. This commit instead uses a binary tree where each node contains the sum of all weights in its left sub-tree. The weights themselves are implicitly stored at the leaves. Inverse queries and updates to the tree all can be done O(log n) resulting an overall O(n log n) weighted shuffle implementation. Based on benchmarks, this results in 24% improvement in WeightedShuffle::shuffle: Fenwick tree: test bench_weighted_shuffle_new ... bench: 36,686 ns/iter (+/- 191) test bench_weighted_shuffle_shuffle ... bench: 342,625 ns/iter (+/- 4,067) Binary tree: test bench_weighted_shuffle_new ... bench: 59,131 ns/iter (+/- 362) test bench_weighted_shuffle_shuffle ... bench: 260,194 ns/iter (+/- 11,195) Though WeightedShuffle::new is now slower, it generally can be cached and reused as in Turbine: https://github.com/anza-xyz/agave/blob/b3fd87fe8/turbine/src/cluster_nodes.rs#L68 Additionally the new code has better asymptotic performance. For example with 20_000 weights WeightedShuffle::shuffle is 31% faster: Fenwick tree: test bench_weighted_shuffle_new ... bench: 255,071 ns/iter (+/- 9,591) test bench_weighted_shuffle_shuffle ... bench: 2,466,058 ns/iter (+/- 9,873) Binary tree: test bench_weighted_shuffle_new ... bench: 830,727 ns/iter (+/- 10,210) test bench_weighted_shuffle_shuffle ... bench: 1,696,160 ns/iter (+/- 75,271)
52e26b9
to
18531c9
Compare
This is partial port of firedancer's implementation of weighted shuffle: The rest is implemented in #259. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me. is it worth periodically caching the weighted shuffle in push_active_set.rs
since it takes longer to build? Although the shuffle
op more than makes up for the slower build.
I also tested a few other combinations on my end that worked as expected.
self.arr[k] -= weight; | ||
k += k & k.wrapping_neg(); | ||
fn remove(&mut self, k: usize, weight: T) { | ||
self.weight -= weight; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this result in a negative self.weight
? or an underflow if self.weight
is unsigned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, unless there is a bug in this code.
self.weight
is meant to be sum of all weights.
This function remove
is only meant to be called when an item is sampled and its corresponding weight is removed from the tree. Because before sampling the item's weight is included in self.weight
, self.weight >= weight
.
while index < self.tree.len() { | ||
if val < self.tree[index] { | ||
weight = self.tree[index]; | ||
index = (index << 1) + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possible over flow here? but i the tree would have to be insanely big to have any issues. so should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true
The following commit #259 will speed up |
This is port of firedancer's implementation of weighted shuffle: https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c anza-xyz#185 implemented weighted shuffle using binary tree. Though asymptotically a binary tree has better performance, compared to a Fenwick tree, it is less cache local resulting in smaller improvements and in particular slower WeightedShuffle::new. In order to improve cache locality and reduce the overheads of traversing the tree, this commit instead uses a generalized N-ary tree with fanout of 16, showing significant improvements in both WeightedShuffle::new and WeightedShuffle::shuffle. With 4000 weights: N-ary tree (fanout 16): test bench_weighted_shuffle_new ... bench: 36,244 ns/iter (+/- 243) test bench_weighted_shuffle_shuffle ... bench: 149,082 ns/iter (+/- 1,474) Binary tree: test bench_weighted_shuffle_new ... bench: 58,514 ns/iter (+/- 229) test bench_weighted_shuffle_shuffle ... bench: 269,961 ns/iter (+/- 16,446) Fenwick tree: test bench_weighted_shuffle_new ... bench: 39,413 ns/iter (+/- 179) test bench_weighted_shuffle_shuffle ... bench: 364,771 ns/iter (+/- 2,078) The improvements become even more significant as there are more items to shuffle. With 20_000 weights: N-ary tree (fanout 16): test bench_weighted_shuffle_new ... bench: 200,659 ns/iter (+/- 4,395) test bench_weighted_shuffle_shuffle ... bench: 941,928 ns/iter (+/- 26,492) Binary tree: test bench_weighted_shuffle_new ... bench: 881,114 ns/iter (+/- 12,343) test bench_weighted_shuffle_shuffle ... bench: 1,822,257 ns/iter (+/- 12,772) Fenwick tree: test bench_weighted_shuffle_new ... bench: 276,936 ns/iter (+/- 14,692) test bench_weighted_shuffle_shuffle ... bench: 2,644,713 ns/iter (+/- 49,252)
This is port of firedancer's implementation of weighted shuffle: https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c #185 implemented weighted shuffle using binary tree. Though asymptotically a binary tree has better performance, compared to a Fenwick tree, it has less cache locality resulting in smaller improvements and in particular slower WeightedShuffle::new. In order to improve cache locality and reduce the overheads of traversing the tree, this commit instead uses a generalized N-ary tree with fanout of 16, showing significant improvements in both WeightedShuffle::new and WeightedShuffle::shuffle. With 4000 weights: N-ary tree (fanout 16): test bench_weighted_shuffle_new ... bench: 36,244 ns/iter (+/- 243) test bench_weighted_shuffle_shuffle ... bench: 149,082 ns/iter (+/- 1,474) Binary tree: test bench_weighted_shuffle_new ... bench: 58,514 ns/iter (+/- 229) test bench_weighted_shuffle_shuffle ... bench: 269,961 ns/iter (+/- 16,446) Fenwick tree: test bench_weighted_shuffle_new ... bench: 39,413 ns/iter (+/- 179) test bench_weighted_shuffle_shuffle ... bench: 364,771 ns/iter (+/- 2,078) The improvements become even more significant as there are more items to shuffle. With 20_000 weights: N-ary tree (fanout 16): test bench_weighted_shuffle_new ... bench: 200,659 ns/iter (+/- 4,395) test bench_weighted_shuffle_shuffle ... bench: 941,928 ns/iter (+/- 26,492) Binary tree: test bench_weighted_shuffle_new ... bench: 881,114 ns/iter (+/- 12,343) test bench_weighted_shuffle_shuffle ... bench: 1,822,257 ns/iter (+/- 12,772) Fenwick tree: test bench_weighted_shuffle_new ... bench: 276,936 ns/iter (+/- 14,692) test bench_weighted_shuffle_shuffle ... bench: 2,644,713 ns/iter (+/- 49,252)
Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis. |
This is partial port of firedancer's implementation of weighted shuffle: https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c Though Fenwick trees use less space, inverse queries require an additional O(log n) factor for binary search resulting an overall O(n log n log n) performance for weighted shuffle. This commit instead uses a binary tree where each node contains the sum of all weights in its left sub-tree. The weights themselves are implicitly stored at the leaves. Inverse queries and updates to the tree all can be done O(log n) resulting an overall O(n log n) weighted shuffle implementation. Based on benchmarks, this results in 24% improvement in WeightedShuffle::shuffle: Fenwick tree: test bench_weighted_shuffle_new ... bench: 36,686 ns/iter (+/- 191) test bench_weighted_shuffle_shuffle ... bench: 342,625 ns/iter (+/- 4,067) Binary tree: test bench_weighted_shuffle_new ... bench: 59,131 ns/iter (+/- 362) test bench_weighted_shuffle_shuffle ... bench: 260,194 ns/iter (+/- 11,195) Though WeightedShuffle::new is now slower, it generally can be cached and reused as in Turbine: https://github.com/anza-xyz/agave/blob/b3fd87fe8/turbine/src/cluster_nodes.rs#L68 Additionally the new code has better asymptotic performance. For example with 20_000 weights WeightedShuffle::shuffle is 31% faster: Fenwick tree: test bench_weighted_shuffle_new ... bench: 255,071 ns/iter (+/- 9,591) test bench_weighted_shuffle_shuffle ... bench: 2,466,058 ns/iter (+/- 9,873) Binary tree: test bench_weighted_shuffle_new ... bench: 830,727 ns/iter (+/- 10,210) test bench_weighted_shuffle_shuffle ... bench: 1,696,160 ns/iter (+/- 75,271) (cherry picked from commit b6d2237)
… (#425) implements weighted shuffle using binary tree (#185) This is partial port of firedancer's implementation of weighted shuffle: https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c Though Fenwick trees use less space, inverse queries require an additional O(log n) factor for binary search resulting an overall O(n log n log n) performance for weighted shuffle. This commit instead uses a binary tree where each node contains the sum of all weights in its left sub-tree. The weights themselves are implicitly stored at the leaves. Inverse queries and updates to the tree all can be done O(log n) resulting an overall O(n log n) weighted shuffle implementation. Based on benchmarks, this results in 24% improvement in WeightedShuffle::shuffle: Fenwick tree: test bench_weighted_shuffle_new ... bench: 36,686 ns/iter (+/- 191) test bench_weighted_shuffle_shuffle ... bench: 342,625 ns/iter (+/- 4,067) Binary tree: test bench_weighted_shuffle_new ... bench: 59,131 ns/iter (+/- 362) test bench_weighted_shuffle_shuffle ... bench: 260,194 ns/iter (+/- 11,195) Though WeightedShuffle::new is now slower, it generally can be cached and reused as in Turbine: https://github.com/anza-xyz/agave/blob/b3fd87fe8/turbine/src/cluster_nodes.rs#L68 Additionally the new code has better asymptotic performance. For example with 20_000 weights WeightedShuffle::shuffle is 31% faster: Fenwick tree: test bench_weighted_shuffle_new ... bench: 255,071 ns/iter (+/- 9,591) test bench_weighted_shuffle_shuffle ... bench: 2,466,058 ns/iter (+/- 9,873) Binary tree: test bench_weighted_shuffle_new ... bench: 830,727 ns/iter (+/- 10,210) test bench_weighted_shuffle_shuffle ... bench: 1,696,160 ns/iter (+/- 75,271) (cherry picked from commit b6d2237) Co-authored-by: behzad nouri <[email protected]>
This is port of firedancer's implementation of weighted shuffle: https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c #185 implemented weighted shuffle using binary tree. Though asymptotically a binary tree has better performance, compared to a Fenwick tree, it has less cache locality resulting in smaller improvements and in particular slower WeightedShuffle::new. In order to improve cache locality and reduce the overheads of traversing the tree, this commit instead uses a generalized N-ary tree with fanout of 16, showing significant improvements in both WeightedShuffle::new and WeightedShuffle::shuffle. With 4000 weights: N-ary tree (fanout 16): test bench_weighted_shuffle_new ... bench: 36,244 ns/iter (+/- 243) test bench_weighted_shuffle_shuffle ... bench: 149,082 ns/iter (+/- 1,474) Binary tree: test bench_weighted_shuffle_new ... bench: 58,514 ns/iter (+/- 229) test bench_weighted_shuffle_shuffle ... bench: 269,961 ns/iter (+/- 16,446) Fenwick tree: test bench_weighted_shuffle_new ... bench: 39,413 ns/iter (+/- 179) test bench_weighted_shuffle_shuffle ... bench: 364,771 ns/iter (+/- 2,078) The improvements become even more significant as there are more items to shuffle. With 20_000 weights: N-ary tree (fanout 16): test bench_weighted_shuffle_new ... bench: 200,659 ns/iter (+/- 4,395) test bench_weighted_shuffle_shuffle ... bench: 941,928 ns/iter (+/- 26,492) Binary tree: test bench_weighted_shuffle_new ... bench: 881,114 ns/iter (+/- 12,343) test bench_weighted_shuffle_shuffle ... bench: 1,822,257 ns/iter (+/- 12,772) Fenwick tree: test bench_weighted_shuffle_new ... bench: 276,936 ns/iter (+/- 14,692) test bench_weighted_shuffle_shuffle ... bench: 2,644,713 ns/iter (+/- 49,252) (cherry picked from commit 30eecd6)
#429) implements weighted shuffle using N-ary tree (#259) This is port of firedancer's implementation of weighted shuffle: https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c #185 implemented weighted shuffle using binary tree. Though asymptotically a binary tree has better performance, compared to a Fenwick tree, it has less cache locality resulting in smaller improvements and in particular slower WeightedShuffle::new. In order to improve cache locality and reduce the overheads of traversing the tree, this commit instead uses a generalized N-ary tree with fanout of 16, showing significant improvements in both WeightedShuffle::new and WeightedShuffle::shuffle. With 4000 weights: N-ary tree (fanout 16): test bench_weighted_shuffle_new ... bench: 36,244 ns/iter (+/- 243) test bench_weighted_shuffle_shuffle ... bench: 149,082 ns/iter (+/- 1,474) Binary tree: test bench_weighted_shuffle_new ... bench: 58,514 ns/iter (+/- 229) test bench_weighted_shuffle_shuffle ... bench: 269,961 ns/iter (+/- 16,446) Fenwick tree: test bench_weighted_shuffle_new ... bench: 39,413 ns/iter (+/- 179) test bench_weighted_shuffle_shuffle ... bench: 364,771 ns/iter (+/- 2,078) The improvements become even more significant as there are more items to shuffle. With 20_000 weights: N-ary tree (fanout 16): test bench_weighted_shuffle_new ... bench: 200,659 ns/iter (+/- 4,395) test bench_weighted_shuffle_shuffle ... bench: 941,928 ns/iter (+/- 26,492) Binary tree: test bench_weighted_shuffle_new ... bench: 881,114 ns/iter (+/- 12,343) test bench_weighted_shuffle_shuffle ... bench: 1,822,257 ns/iter (+/- 12,772) Fenwick tree: test bench_weighted_shuffle_new ... bench: 276,936 ns/iter (+/- 14,692) test bench_weighted_shuffle_shuffle ... bench: 2,644,713 ns/iter (+/- 49,252) (cherry picked from commit 30eecd6) Co-authored-by: behzad nouri <[email protected]>
#429) implements weighted shuffle using N-ary tree (#259) This is port of firedancer's implementation of weighted shuffle: https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c anza-xyz#185 implemented weighted shuffle using binary tree. Though asymptotically a binary tree has better performance, compared to a Fenwick tree, it has less cache locality resulting in smaller improvements and in particular slower WeightedShuffle::new. In order to improve cache locality and reduce the overheads of traversing the tree, this commit instead uses a generalized N-ary tree with fanout of 16, showing significant improvements in both WeightedShuffle::new and WeightedShuffle::shuffle. With 4000 weights: N-ary tree (fanout 16): test bench_weighted_shuffle_new ... bench: 36,244 ns/iter (+/- 243) test bench_weighted_shuffle_shuffle ... bench: 149,082 ns/iter (+/- 1,474) Binary tree: test bench_weighted_shuffle_new ... bench: 58,514 ns/iter (+/- 229) test bench_weighted_shuffle_shuffle ... bench: 269,961 ns/iter (+/- 16,446) Fenwick tree: test bench_weighted_shuffle_new ... bench: 39,413 ns/iter (+/- 179) test bench_weighted_shuffle_shuffle ... bench: 364,771 ns/iter (+/- 2,078) The improvements become even more significant as there are more items to shuffle. With 20_000 weights: N-ary tree (fanout 16): test bench_weighted_shuffle_new ... bench: 200,659 ns/iter (+/- 4,395) test bench_weighted_shuffle_shuffle ... bench: 941,928 ns/iter (+/- 26,492) Binary tree: test bench_weighted_shuffle_new ... bench: 881,114 ns/iter (+/- 12,343) test bench_weighted_shuffle_shuffle ... bench: 1,822,257 ns/iter (+/- 12,772) Fenwick tree: test bench_weighted_shuffle_new ... bench: 276,936 ns/iter (+/- 14,692) test bench_weighted_shuffle_shuffle ... bench: 2,644,713 ns/iter (+/- 49,252) (cherry picked from commit 30eecd6) Co-authored-by: behzad nouri <[email protected]>
…a-xyz#185) (anza-xyz#425) implements weighted shuffle using binary tree (anza-xyz#185) This is partial port of firedancer's implementation of weighted shuffle: https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c Though Fenwick trees use less space, inverse queries require an additional O(log n) factor for binary search resulting an overall O(n log n log n) performance for weighted shuffle. This commit instead uses a binary tree where each node contains the sum of all weights in its left sub-tree. The weights themselves are implicitly stored at the leaves. Inverse queries and updates to the tree all can be done O(log n) resulting an overall O(n log n) weighted shuffle implementation. Based on benchmarks, this results in 24% improvement in WeightedShuffle::shuffle: Fenwick tree: test bench_weighted_shuffle_new ... bench: 36,686 ns/iter (+/- 191) test bench_weighted_shuffle_shuffle ... bench: 342,625 ns/iter (+/- 4,067) Binary tree: test bench_weighted_shuffle_new ... bench: 59,131 ns/iter (+/- 362) test bench_weighted_shuffle_shuffle ... bench: 260,194 ns/iter (+/- 11,195) Though WeightedShuffle::new is now slower, it generally can be cached and reused as in Turbine: https://github.com/anza-xyz/agave/blob/b3fd87fe8/turbine/src/cluster_nodes.rs#L68 Additionally the new code has better asymptotic performance. For example with 20_000 weights WeightedShuffle::shuffle is 31% faster: Fenwick tree: test bench_weighted_shuffle_new ... bench: 255,071 ns/iter (+/- 9,591) test bench_weighted_shuffle_shuffle ... bench: 2,466,058 ns/iter (+/- 9,873) Binary tree: test bench_weighted_shuffle_new ... bench: 830,727 ns/iter (+/- 10,210) test bench_weighted_shuffle_shuffle ... bench: 1,696,160 ns/iter (+/- 75,271) (cherry picked from commit b6d2237) Co-authored-by: behzad nouri <[email protected]>
This is partial port of firedancer's implementation of weighted shuffle:
https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c
Problem
Though Fenwick trees use less space, inverse queries require an
additional
O(log n)
factor for binary search resulting an overallO(n log n log n)
performance for weighted shuffle.Summary of Changes
This commit instead uses a binary tree where each node contains the sum
of all weights in its left sub-tree. The weights themselves are
implicitly stored at the leaves. Inverse queries and updates to the tree
all can be done
O(log n)
resulting an overallO(n log n)
weightedshuffle implementation.
Based on benchmarks, this results in 24% improvement in
WeightedShuffle::shuffle
:Fenwick tree:
Binary tree:
Though
WeightedShuffle::new
is now slower, it generally can be cachedand reused as in Turbine:
https://github.com/anza-xyz/agave/blob/b3fd87fe8/turbine/src/cluster_nodes.rs#L68
Additionally the new code has better asymptotic performance. For
example with 20_000 weights
WeightedShuffle::shuffle
is 31% faster:Fenwick tree:
Binary tree: